Attention heads – side by side
Model: · Layers: 24 · Heads: 14 · Tokens: 35. Hover cells for raw values (weights after softmax).
How to read the numbers
- Row labels = query tokens. Column labels = key tokens (same sequence). Each column is the key from that token.
- Each cell (row, col) = attention weight from the row token (query) to the column token (key).
- Range: 0 to 1. For each row, the numbers sum to 1 (they are probabilities after softmax).
- 0 = this query token ignores that key token. 1.0 = full focus on that key.
- Row reading: Pick a row (e.g. "Ġfox"). The values show how that token "distributes" its attention over earlier tokens (and itself).
- Upper-right zeros: Causal masking — token at position i only attends to positions 0…i, so j > i is always 0.
- Diagonal: Often high (self-attention); the token looking at itself.
- Leading space: Tokens like " quick" show a space — the tokenizer marks word starts that way (BPE uses Ġ; we display it as a space).